library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library("tidymodels")
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom        1.0.3     ✔ rsample      1.1.1
## ✔ dials        1.1.0     ✔ tune         1.0.1
## ✔ infer        1.0.4     ✔ workflows    1.1.2
## ✔ modeldata    1.1.0     ✔ workflowsets 1.0.0
## ✔ parsnip      1.0.3     ✔ yardstick    1.1.0
## ✔ recipes      1.0.4     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
library("tidylog")
## 
## Attaching package: 'tidylog'
## 
## The following object is masked from 'package:rsample':
## 
##     gather
## 
## The following objects are masked from 'package:dplyr':
## 
##     add_count, add_tally, anti_join, count, distinct, distinct_all,
##     distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
##     full_join, group_by, group_by_all, group_by_at, group_by_if,
##     inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
##     relocate, rename, rename_all, rename_at, rename_if, rename_with,
##     right_join, sample_frac, sample_n, select, select_all, select_at,
##     select_if, semi_join, slice, slice_head, slice_max, slice_min,
##     slice_sample, slice_tail, summarise, summarise_all, summarise_at,
##     summarise_if, summarize, summarize_all, summarize_at, summarize_if,
##     tally, top_frac, top_n, transmute, transmute_all, transmute_at,
##     transmute_if, ungroup
## 
## The following objects are masked from 'package:tidyr':
## 
##     drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
##     spread, uncount
## 
## The following object is masked from 'package:stats':
## 
##     filter
# Reading in the Walmart dataset:

dfw <- read_csv("data/walmart.csv") %>% 
  rename_with(tolower) %>% 
  arrange(store)
## Rows: 6435 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (7): Store, Temperature, Fuel_Price, CPI, Unemployment, Size, Weekly_Sales
## lgl  (1): IsHoliday
## date (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## rename_with: renamed 9 variables (store, date, isholiday, temperature, fuel_price, …)
head(dfw)
#An overview of our dataset's structure

str(dfw)
## spc_tbl_ [6,435 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ store       : num [1:6435] 1 1 1 1 1 1 1 1 1 1 ...
##  $ date        : Date[1:6435], format: "2010-04-16" "2012-04-06" ...
##  $ isholiday   : logi [1:6435] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ temperature : num [1:6435] 66.3 70.4 87.2 42.3 84.8 ...
##  $ fuel_price  : num [1:6435] 2.81 3.89 2.63 2.57 3.57 ...
##  $ cpi         : num [1:6435] 210 221 212 211 222 ...
##  $ unemployment: num [1:6435] 7.81 7.14 7.79 8.11 6.91 ...
##  $ size        : num [1:6435] 151315 151315 151315 151315 151315 ...
##  $ weekly_sales: num [1:6435] 1105515 1505325 837329 1112467 1085133 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Store = col_double(),
##   ..   Date = col_date(format = ""),
##   ..   IsHoliday = col_logical(),
##   ..   Temperature = col_double(),
##   ..   Fuel_Price = col_double(),
##   ..   CPI = col_double(),
##   ..   Unemployment = col_double(),
##   ..   Size = col_double(),
##   ..   Weekly_Sales = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
#An overview of the data

head(dfw)
#The structure of the data and each column's data type.  Note that isholiday is a logical (i.e., true/false) predictor variable

str(dfw)
## spc_tbl_ [6,435 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ store       : num [1:6435] 1 1 1 1 1 1 1 1 1 1 ...
##  $ date        : Date[1:6435], format: "2010-04-16" "2012-04-06" ...
##  $ isholiday   : logi [1:6435] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ temperature : num [1:6435] 66.3 70.4 87.2 42.3 84.8 ...
##  $ fuel_price  : num [1:6435] 2.81 3.89 2.63 2.57 3.57 ...
##  $ cpi         : num [1:6435] 210 221 212 211 222 ...
##  $ unemployment: num [1:6435] 7.81 7.14 7.79 8.11 6.91 ...
##  $ size        : num [1:6435] 151315 151315 151315 151315 151315 ...
##  $ weekly_sales: num [1:6435] 1105515 1505325 837329 1112467 1085133 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Store = col_double(),
##   ..   Date = col_date(format = ""),
##   ..   IsHoliday = col_logical(),
##   ..   Temperature = col_double(),
##   ..   Fuel_Price = col_double(),
##   ..   CPI = col_double(),
##   ..   Unemployment = col_double(),
##   ..   Size = col_double(),
##   ..   Weekly_Sales = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Simple Linear Regression Model

We will begin by running a simple linear model that regresses weekly sales onto Consumer Price Index (CPI)

# Specifying our model type and setting the computational engine

linear_model <- 
  linear_reg() %>% 
  set_engine("lm")

# Fitting the model

fit_cpi <- 
  linear_model %>% 
  fit(weekly_sales ~ cpi, data = dfw)
  
# Model output 

summary(fit_cpi$fit)
## 
## Call:
## stats::lm(formula = weekly_sales ~ cpi, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -662386 -318443  -73868  258442 2095880 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 827280.5    21778.4  37.986  < 2e-16 ***
## cpi           -732.7      123.7  -5.923 3.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 390600 on 6433 degrees of freedom
## Multiple R-squared:  0.005423,   Adjusted R-squared:  0.005269 
## F-statistic: 35.08 on 1 and 6433 DF,  p-value: 3.332e-09
In this model, a Walmart store with a theoretical square footage of 0 can expect its weekly sales to be ~$828,280 if CPI is held constant. We also observe that the relationship between Weekly_Sales and CPI is negative. That is, if CPI increases by one unit, weekly sales will decrease by ~$733; and if CPI decreases by one unit, sales would increase by ~$733.
In evaluating the model statistics, we can see an Adjusted R_Squared value of 0.005269. In other words, this model explains only roughly 0.5% of the variance in Walmart’s weekly sales. So, while our interpretation of the effect of CPI on Weekly_Sales is still valid, we must conclude that this model appears to fail in explaining the variance in our target variable.
# Now we will plot the affect of CPI on sales for a few different stores in the dataset, starting with store 10.

plot_store_10 <- 
  dfw %>% 
  filter(store == 10) %>% 
  ggplot(aes(x = cpi, y = weekly_sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 10', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()
## filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_10)
## `geom_smooth()` using formula = 'y ~ x'
plot_store_11 <- 
  dfw %>% 
  filter(store == 11) %>% 
  ggplot(aes(x = cpi, y = weekly_sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 11', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()
## filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_11)
## `geom_smooth()` using formula = 'y ~ x'
plot_store_12 <- 
  dfw %>% 
  filter(store == 10) %>% 
  ggplot(aes(x = cpi, y = weekly_sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 12', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()
## filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_12)
## `geom_smooth()` using formula = 'y ~ x'
plot_store_13 <- 
  dfw %>% 
  filter(store == 13) %>% 
  ggplot(aes(x = cpi, y = weekly_sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 13', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()
## filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_13)
## `geom_smooth()` using formula = 'y ~ x'
# A plot to demonstrate the fluctuation of CPI by region/store.  Note that the
# smoothed line is negative in some locales and positive in others.

animated_plot <- 
    dfw %>% 
    filter(store %in% c(11:15)) %>% 
    ggplot(aes(x = cpi, y = weekly_sales)) + 
    geom_point() + 
    geom_smooth(method = lm) +
    labs(title = 'Weekly Sales vs. CPI for Store {closest_state}', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
    theme_minimal() + 
    gganimate::transition_states(store, transition_length = 1, state_length = 2) +
    gganimate::view_follow()
## filter: removed 5,720 rows (89%), 715 rows remaining
animated_plot
## `geom_smooth()` using formula = 'y ~ x'

What we observe here is that the impact of CPI can vary greatly by store/region. This still aligns with our evaluation of fit_cpi because we recall that that particular model explained only a small amount (~5%) of the variance in Weekly_Sales, so we would expect to see these kinds of swings. With a (much) higher Adjusted R-Squared, these variations would look unusual.
dfw %>%
  group_by(store) %>%
  group_modify(~ tidy(lm(weekly_sales ~ ., data = .x))) %>%
  filter(term == "cpi")
## group_by: one grouping variable (store)
## filter (grouped): removed 315 rows (88%), 45 rows remaining
# Filtering for 2012 and plotting CPI against Weekly_Sales

plot <- dfw %>%
            filter(lubridate::year(date) == 2012) %>%
              ggplot(aes(x=cpi, y = weekly_sales)) +
              geom_point() +
                geom_smooth(method=lm)
## filter: removed 4,500 rows (70%), 1,935 rows remaining
    labs(title = 'Weekly Sales vs. CPI for Store 12', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()
## NULL
plotly::ggplotly(plot)
## `geom_smooth()` using formula = 'y ~ x'
We see an interesting effect when we filter for one specific year. The clusters are nearly vertical because CPI is calculated geographically, with either Core Based Statistical Area (CBSA) or Metropolitan Statistical Area (MSA). CPI might be the same in a particular region, but different stores in that region will have different sales volume, hence the vertical clusters.
# Now let's look exclusively at store 10

plot_store_cpi <- dfw %>%
            filter(store==10, lubridate::year(date)==2012) %>%
            ggplot(aes(x=cpi, y = weekly_sales)) +
            geom_point() +
            geom_smooth(method=lm)
## filter: removed 6,392 rows (99%), 43 rows remaining
    labs(title = 'Weekly Sales vs. CPI for Store 10 in 2012', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()
## NULL
plotly::ggplotly(plot_store_cpi)
## `geom_smooth()` using formula = 'y ~ x'
Although CPI varies by region, the deviation in CPI across time for a single region tends to be much lower, which is why we see such a slim range here. Since CPI is a measure of inflation, we expect to see these regional effects.
# A new iteration of the previous model that also includes store Size as an independent variable

fit_cpi_size <- 
  linear_model %>% 
  fit(weekly_sales ~ cpi + size, data = dfw)

summary(fit_cpi_size$fit)
## 
## Call:
## stats::lm(formula = weekly_sales ~ cpi + size, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -563750 -167145  -29612  112172 1912650 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.828e+05  1.497e+04  12.216   <2e-16 ***
## cpi         -6.570e+02  7.692e+01  -8.542   <2e-16 ***
## size         4.847e+00  4.796e-02 101.048   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 242800 on 6432 degrees of freedom
## Multiple R-squared:  0.6156, Adjusted R-squared:  0.6155 
## F-statistic:  5151 on 2 and 6432 DF,  p-value: < 2.2e-16
# Comparing fit_cpi to fit_size to see which is better at explaining the variance in Weekly_Sales

anova(fit_cpi$fit, fit_cpi_size$fit)
The model that includes size as a predictor variable (fit_cpi_size) appears to perform significantly better than fit_cpi. Adjusted R-Square now explains ~62% of the variance in rentals and the ANOVA test confirms that including size is statistically significant.
tidy(fit_cpi$fit)
tidy(fit_cpi_size$fit)
Note also that the coefficient in the revised model has been reduced from ~$733 to ~$657. This is simply due to the fact that size is now explaining more of the variance that was left unexplained by the previous model that only included CPI.
# Building a model that uses all variables EXCEPT Date and Store

fit_full <- 
  linear_model %>% 
  fit(weekly_sales ~ . - store - date, data = dfw)

summary(fit_full$fit)
## 
## Call:
## stats::lm(formula = weekly_sales ~ . - store - date, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -557148 -165608  -24125  112851 1918479 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.133e+05  3.546e+04   8.834  < 2e-16 ***
## isholidayTRUE  6.012e+04  1.196e+04   5.026 5.14e-07 ***
## temperature    1.002e+03  1.739e+02   5.761 8.72e-09 ***
## fuel_price    -1.333e+04  6.822e+03  -1.954   0.0507 .  
## cpi           -9.461e+02  8.445e+01 -11.203  < 2e-16 ***
## unemployment  -1.252e+04  1.725e+03  -7.258 4.40e-13 ***
## size           4.840e+00  4.802e-02 100.786  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 241200 on 6428 degrees of freedom
## Multiple R-squared:  0.621,  Adjusted R-squared:  0.6206 
## F-statistic:  1755 on 6 and 6428 DF,  p-value: < 2.2e-16
anova(fit_cpi_size$fit, fit_full$fit)
We observe a further, though slight improvement in the Adjusted R-Squared value in the new model that eliminates temporal and regional effects (fit_full). The ANOVA test also confirms that the improvement in explanatory power is indeed statistically significant.

More Linear Regression

We hypothesize that the effect of good weather is increased on holidays. We can test this by revising fit_full and including an interaction term.
fit_full_int <- 
  linear_model %>% 
  fit(weekly_sales ~ . - store - date + isholiday * temperature, data = dfw)

summary(fit_full_int$fit)
## 
## Call:
## stats::lm(formula = weekly_sales ~ . - store - date + isholiday * 
##     temperature, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -557499 -165415  -24493  112914 1918376 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.148e+05  3.565e+04   8.830  < 2e-16 ***
## isholidayTRUE              4.745e+04  3.265e+04   1.453   0.1462    
## temperature                9.809e+02  1.808e+02   5.424 6.04e-08 ***
## fuel_price                -1.342e+04  6.826e+03  -1.966   0.0493 *  
## cpi                       -9.460e+02  8.446e+01 -11.200  < 2e-16 ***
## unemployment              -1.251e+04  1.725e+03  -7.254 4.53e-13 ***
## size                       4.840e+00  4.802e-02 100.779  < 2e-16 ***
## isholidayTRUE:temperature  2.473e+02  5.932e+02   0.417   0.6768    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 241200 on 6427 degrees of freedom
## Multiple R-squared:  0.621,  Adjusted R-squared:  0.6206 
## F-statistic:  1504 on 7 and 6427 DF,  p-value: < 2.2e-16
anova(fit_full$fit, fit_full_int$fit)
Although the results of our fit_full_int model demonstrate that the effect of good weather is indeed more significant on holidays, the ANOVA test shows no statistically significant improvement. We cannot assert definitively that this model with the interaction term is an improvement.
We’ll also test whether the effect of temperature on weekly sales is linear by squaring that variable.
fit_full_sq <- 
  linear_model %>% 
  fit(weekly_sales ~ . - store - date + I(temperature ^2), data = dfw)

summary(fit_full_sq$fit)
## 
## Call:
## stats::lm(formula = weekly_sales ~ . - store - date + I(temperature^2), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -561455 -165260  -24674  112058 1911166 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.610e+05  4.111e+04   6.350 2.30e-10 ***
## isholidayTRUE     6.230e+04  1.199e+04   5.197 2.09e-07 ***
## temperature       3.294e+03  9.301e+02   3.542   0.0004 ***
## fuel_price       -1.471e+04  6.841e+03  -2.151   0.0315 *  
## cpi              -9.547e+02  8.449e+01 -11.300  < 2e-16 ***
## unemployment     -1.253e+04  1.724e+03  -7.268 4.09e-13 ***
## size              4.831e+00  4.811e-02 100.420  < 2e-16 ***
## I(temperature^2) -1.982e+01  7.901e+00  -2.509   0.0121 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 241100 on 6427 degrees of freedom
## Multiple R-squared:  0.6214, Adjusted R-squared:  0.621 
## F-statistic:  1507 on 7 and 6427 DF,  p-value: < 2.2e-16
anova(fit_full$fit, fit_full_sq$fit)
## Plotting the relationship between Temperature^2 and Weekly Sales 

dfw %>%
  ggplot(aes(x = temperature, y = weekly_sales)) + 
  geom_smooth(method = "lm", formula = y ~ x + I(x^2))

The model output demonstrates a curvilinear, or inverted U-shaped relationship (visualized below). People are less likely to shop retail on a freezing cold day. Increasing temperatures are associated with increased sales, but only to a point. As temperatures become excessive and dangerous, sales start to decrease.
If we were managing Walmart’s promotions we could offer larger discounts when the whether is at either extreme and perhaps even increase the price of certain products when the temperature is mild.

Predictive Analytics

Now that we have a model that is fairly robust we will use it to make predictions of weekly sales revenue.
# Setting seed for reproducibility

set.seed(3.14159)

# Splitting the data set into a training dataset (75%) and a test dataset (25%)

dfw_split <- initial_split(dfw)

dfw_train <-  training(dfw_split)

dfw_test <- testing(dfw_split)
# Fitting the model

fit_org <- 
  linear_model %>% 
  fit(weekly_sales ~ . - date - store + I(temperature^2), data = dfw_train)

summary(fit_org$fit)
## 
## Call:
## stats::lm(formula = weekly_sales ~ . - date - store + I(temperature^2), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -557260 -165114  -25112  115048 1913671 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.546e+05  4.725e+04   5.389 7.42e-08 ***
## isholidayTRUE     6.038e+04  1.397e+04   4.323 1.57e-05 ***
## temperature       3.056e+03  1.068e+03   2.861  0.00424 ** 
## fuel_price       -1.939e+04  7.819e+03  -2.480  0.01316 *  
## cpi              -9.217e+02  9.640e+01  -9.561  < 2e-16 ***
## unemployment     -1.058e+04  1.992e+03  -5.312 1.14e-07 ***
## size              4.826e+00  5.496e-02  87.809  < 2e-16 ***
## I(temperature^2) -1.628e+01  9.058e+00  -1.797  0.07237 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 239000 on 4818 degrees of freedom
## Multiple R-squared:  0.6248, Adjusted R-squared:  0.6242 
## F-statistic:  1146 on 7 and 4818 DF,  p-value: < 2.2e-16
# The linear regression output as a tibble

tidy(fit_org)
# Creating a new dataframe with predicted values

results_org <-
  predict(fit_org, new_data = dfw_test) %>% 
  bind_cols(dfw_test) %>% 
  rename(Predicted_Sales = .pred)
## rename: renamed one variable (Predicted_Sales)
results_org %>% 
  arrange(date)
# Defining the metric set we will be working with to evaluate the models

perf_metrics <- metric_set(rmse, mae)
# Calculating the performance of fit

perf_metrics(results_org, truth =  weekly_sales, estimate = Predicted_Sales)
These metrics indicate that our model is off by ~$240,424 according to RMSE and ~$179,092 MAE. These numbers appear alarming until one recalls that the range of weekly sales by Walmart store location is about $70k to $2.8 million, with a mean of $740k and a median of $689k.
#Building the model without I(Temperature^2) variable using only the training data set

fit_org_nosq <-
  linear_model %>% 
  fit(weekly_sales ~ . - store - date, data = dfw_train)
summary(fit_org_nosq$fit)
## 
## Call:
## stats::lm(formula = weekly_sales ~ . - store - date, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -553813 -165778  -24194  114613 1919644 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.981e+05  4.060e+04   7.344 2.42e-13 ***
## isholidayTRUE  5.846e+04  1.393e+04   4.197 2.76e-05 ***
## temperature    1.170e+03  1.998e+02   5.857 5.02e-09 ***
## fuel_price    -1.842e+04  7.802e+03  -2.360   0.0183 *  
## cpi           -9.137e+02  9.632e+01  -9.486  < 2e-16 ***
## unemployment  -1.056e+04  1.992e+03  -5.302 1.20e-07 ***
## size           4.832e+00  5.486e-02  88.094  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 239000 on 4819 degrees of freedom
## Multiple R-squared:  0.6245, Adjusted R-squared:  0.624 
## F-statistic:  1336 on 6 and 4819 DF,  p-value: < 2.2e-16
#Comparing the models

anova(fit_org$fit, fit_org_nosq$fit)
#Creating a new dataframe results_org_nosq with predictions

results_org_nosq <-
  predict(fit_org_nosq, new_data = dfw_test) %>% 
  bind_cols(dfw_test) %>% 
  rename(Predicted_Sales = .pred)
## rename: renamed one variable (Predicted_Sales)
#Calculating performance metrics

perf_metrics(results_org_nosq, truth =  weekly_sales, estimate = Predicted_Sales)
When we remove the temperature term, something notable occurs. First, our Adjusted R-Squared value diminishes slighlty (from 62.2% to 62.1%), making it a slightly less appealing model in terms of explaining the variance in weekly sales. However, we also observe that the error has been reduced, making fit_nosq relatively superior in terms of predictive capability. Since we are trying to build a reliable predictive model, we exclude the term and conclude that fit_nosq is better for that purpose.

More Predictive Modeling

We are fairly pleased with both the explanatory and predictive power of fit_nosq but of course we would like to improve upon both metrics. One issue that we have not yet discussed is the variability in weekly sales across Walmart locations, as shown below.
# Calculating total weekly sales per store

sales_by_store <- aggregate(weekly_sales ~ store, data = dfw, sum)

# A bar chart showing the distribution of weekly sales revenue by store

ggplot(sales_by_store, aes(x = store, y = weekly_sales)) +
  geom_bar(stat = "identity", fill = "#0078D4") +
  labs(title = "Total Weekly Sales by Store", x = "Store Number", y = "Total Weekly Sales")

Viewing the bar chart, we can see that standardizing the weekly_sales variable could improve our model. One way we could do this is by transforming the scale of weekly sales, transforming each value to its natural logarithmic value. We’ll use the log() function to accomplish this.
# Log model with all other variables unchanged
dfw_log <-
  dfw %>% 
  mutate(log_sales = log(weekly_sales))
## mutate: new variable 'log_sales' (double) with 6,435 unique values and 0% NA
dfw_log
set.seed(3.14159)

dfwlog_split <- initial_split(dfw_log)
dfwlog_train <- training(dfwlog_split)
dfwlog_test <- testing(dfwlog_split)
fit_log <- 
  linear_model %>% 
  fit(log_sales ~ . - store - date - weekly_sales, data=dfwlog_train)

summary(fit_log$fit)
## 
## Call:
## stats::lm(formula = log_sales ~ . - store - date - weekly_sales, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27631 -0.22829 -0.01924  0.22924  1.48007 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.247e+01  5.575e-02 223.594  < 2e-16 ***
## isholidayTRUE  6.378e-02  1.913e-02   3.334 0.000863 ***
## temperature    4.764e-04  2.744e-04   1.736 0.082580 .  
## fuel_price    -7.008e-03  1.072e-02  -0.654 0.513151    
## cpi           -1.185e-03  1.323e-04  -8.958  < 2e-16 ***
## unemployment  -4.849e-03  2.736e-03  -1.772 0.076401 .  
## size           8.095e-06  7.534e-08 107.444  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3283 on 4819 degrees of freedom
## Multiple R-squared:  0.7109, Adjusted R-squared:  0.7106 
## F-statistic:  1975 on 6 and 4819 DF,  p-value: < 2.2e-16
This final model uses a log-linear regression to explain weekly Walmart sales using store-level, economic, and seasonal predictors. Applying a log transformation to weekly sales substantially improves model performance, yielding an adjusted R² of 0.71, and stabilizes variance across stores with vastly different revenue scales. Overall model fit is strong, with well-behaved residuals and a highly significant F-statistic, indicating that the included predictors jointly explain a meaningful share of sales variation.
Results show that store size is the dominant driver of weekly sales, dwarfing macroeconomic effects and confirming that physical scale largely determines revenue potential. Holiday weeks are associated with an average 6–7% increase in sales, validating the importance of seasonal demand spikes. Inflation, proxied by CPI, has a small but statistically significant negative relationship with sales, even after controlling for store characteristics. Temperature and unemployment exhibit modest effects consistent with economic intuition, while fuel prices do not appear to meaningfully impact sales once other factors are accounted for.
This specification represents the best balance between interpretability, explanatory power, and robustness among the models tested. The log transformation enables clear percentage-based interpretations while materially improving fit relative to linear alternatives, making the model suitable for both analytical insight and downstream forecasting.

Limitiations

- The model does not explicitly account for store-level fixed effects or regional hierarchies, which may mask persistent location-specific dynamics.
- Temporal structure is handled implicitly; autocorrelation and seasonality are not directly modeled.
- The analysis assumes linear relationships on the log scale and may understate nonlinear or interaction effects beyond those tested.
- CPI and unemployment are measured at broader geographic levels and may not fully capture local economic conditions.

Next Steps

- Implement mixed-effects (hierarchical) models to capture store-specific variation.
- Explore time-series approaches for improved short-term forecasting.
- Incorporate promotional data, foot traffic, or local demographic variables to enhance predictive accuracy.